Skip to content

Conversation

antgonza
Copy link
Member

@antgonza antgonza commented Sep 1, 2025

No description provided.

@antgonza antgonza changed the title PacBio [WIP] PacBio Sep 1, 2025
@@ -207,43 +207,6 @@ def test_get_orig_names_from_sheet_with_replicates(self):

self.assertEqual(set(obs), exp)

def test_required_file_checks(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test were (1) specific to Illumina so they should not be part of Pipeline, (2) duplication of what normally happens in any code: check if a files exists and raise an error; thus, in my opinion, unnecessary.

Copy link
Contributor

@AmandaBirmingham AmandaBirmingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of typos and questions. Happy to look again when tests pass.

self.mandatory_attributes = ['qclient', 'uif_path',
'lane_number', 'config_fp',
'run_identifier', 'output_dir', 'job_id',
'lane_number', 'is_restart']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does lane_number need to be present twice?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake; maybe we found the culprit of the duplicated lane?!

return failed_samples

def generate_sequence_counts(self):
# for other isntances of generate_sequence_counts in other objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: replace "isntances" with "instances"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

def generate_bam2fastq_commands(sample_list, run_folder, outdir, threads):
"""Generates the bam2fastq commands"""
df = pd.read_csv(sample_list, sep='\t')
files = {f.split('.')[-2]: f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this split/slice grabbing from the file name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some comments about this.

# add to bucket_size
bucket_size += r1_size + r2_size
current_size += r1_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we are hurrying, but can we add an issue to refactor this copy/paste addition in the future? There is so much shared functionality between the two branches of the if/else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create an issue pointing to what needs to be refactored.


for d in openfps.values():
for f in d.values():
f.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The amount of code duplication between this and the original demux command feels to me like a real invitation to future bugs. Assuming we do not have time to refactor this now, can we have an issue to remind us to do it in the future?

if 'chemistry' in header:
chemistry = header['chemistry']
else:
chemistry = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest replacing with

chemistry = header.get('chemistry', '')

which will have the same effect, be shorter, and remove the need to repeat 'chemistry' :)

mtag = tree.getroot().tag.split('}')[0] + '}'
instrument_text = (
f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run')
run = ET.parse(run_info).find(instrument_text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really uncomfy with this being repeated. Could we make a method something like the below

def get_pacbio_run_str(run_info, run_directory):
            pacbio_metadata_fps = glob(
                f'{run_directory}/*/metadata/*.metadata.xml')
            if not pacbio_metadata_fps:
                raise ValueError(f"'{run_info}' doesn't exist")

            run_info = pacbio_metadata_fps[0]
            tree = ET.parse(run_info)
            mtag = tree.getroot().tag.split('}')[0] + '}'
            instrument_text = (
                f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run')
            run = ET.parse(run_info).find(instrument_text)
            return run

and then call it in _get_instrument_id like

if not exists(run_info):
            run = get_pacbio_run_str(run_info, run_directory)
            text = run.attrib['TimeStampedName']

and in _get_instrument_type like

if not exists(run_info):
            run = get_pacbio_run_str(run_info, run_directory)
            date_string = run.attrib['CreatedAt'].split('T')[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you; this made me realize that we could use the same method and replace: get_date_from_run_id.

Description,,,,,,,
,,,,,,,
[Data],,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Sample_Project,Well_description,Lane,barcode_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I don't know if you are round-tripping sample sheets at any point, but if you read this in and then write it back out, barcode_id will show up between Sample_Well and Sample_Project, not at the end, due to the forced ordering of the columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know.

makedirs(f'{genprep_dir}/{sp}/filtered_sequences/',
exist_ok=True)

# # then loop over samples and stage all fastq.gz files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra #?

demux-runner
echo "$(date) :: demux stop"

touch ${OUTPUT}/${SLURM_JOB_NAME}.${SLURM_ARRAY_TASK_ID}.completed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it is almost the same as nuqc_job.sh, with just a few deletions. Is this something we could consider refactoring in the future?

Copy link
Member Author

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AmandaBirmingham for your review. Added this issue: #144.

self.mandatory_attributes = ['qclient', 'uif_path',
'lane_number', 'config_fp',
'run_identifier', 'output_dir', 'job_id',
'lane_number', 'is_restart']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake; maybe we found the culprit of the duplicated lane?!

return failed_samples

def generate_sequence_counts(self):
# for other isntances of generate_sequence_counts in other objects
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

def generate_bam2fastq_commands(sample_list, run_folder, outdir, threads):
"""Generates the bam2fastq commands"""
df = pd.read_csv(sample_list, sep='\t')
files = {f.split('.')[-2]: f
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some comments about this.

# add to bucket_size
bucket_size += r1_size + r2_size
current_size += r1_size

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create an issue pointing to what needs to be refactored.


def run(self, callback=None):
"""
Run BCL2Fastq/BCLConvert conversion
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, updating.

@@ -52,13 +53,15 @@ def __init__(self, fastq_root_dir, output_path, sample_sheet_path,
:param additional_fastq_tags: A list of fastq tags to preserve during
filtering.
:param files_regex: the FILES_REGEX to use for parsing files
:param read_length: the FILES_REGEX to use for parsing files
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching all these.

Description,,,,,,,
,,,,,,,
[Data],,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Sample_Project,Well_description,Lane,barcode_id
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know.

mtag = tree.getroot().tag.split('}')[0] + '}'
instrument_text = (
f'{mtag}ExperimentContainer/{mtag}Runs/{mtag}Run')
run = ET.parse(run_info).find(instrument_text)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you; this made me realize that we could use the same method and replace: get_date_from_run_id.

@antgonza antgonza changed the title [WIP] PacBio PacBio Sep 10, 2025
@AmandaBirmingham AmandaBirmingham self-requested a review September 10, 2025 21:06
@antgonza
Copy link
Member Author

Thank you for all the help @AmandaBirmingham

@antgonza antgonza merged commit 7f33624 into qiita-spots:main Sep 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants